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(54) Web-based platform for interactive voice response (IVR) 



(57) A platform for implementing interactive voice 
response (IVR) applications over the Internet or other 
type of network includes a speech synthesizer, a gram- 
mar generator and a speech recognizer. The speech 
synthesizer generates speech which characterizes the 
structure and content of a web page retrieved over the 
network. The speech is delivered to a user via a tele- 
phone or other type of audio interface device. The gram- 
mar generator utilizes textual information parsed from 
the retrieved web page to produce a grammar. The 
grammar is supplied to the speech recognizer and used 



to interpret voice commands and other speech input 
generated by the user. The platform may also include a 
voice processor which determines which of a number of 
predefined models best characterized a given retrieved 
page, such that the process of generating an appropri- 
ate verbal description of the page is considerably sim- 
plified. The speech synthesizer, grammar generator, 
speech recognizer and other elements of the IVR plat- 
form may be operated by a Internet Service Provider 
(ISP), thereby allowing the general Internet population 
to create interactive voice response applications without 
acquiring their own IVR equipment. 
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Description 
Field of the invention 

roooil The present invention relates generally to the 
nternet and other computer networks, and more partic- 
ularly to techniques for obtaining information over such 
networks via a telephone or other aud.o mterface de- 
vice. 

Background of the Invention 

[00021 The continued growth of the Internet has made 
H a primary source of information on a wide variety of 
opics AcLs to the Internet and other types of com- 
puter networks is typically accomplished via a computer 
equipped with a browser program. The 0™*" In- 
gram provides a graphical user interface wh»ch^lows 
a user to request information from servers accessible 
overthe network, and to view and otherw.se ^process the 
information so obtained. Techniques for extending Inter- 
net aLss to users equipped with only a telephone or 
other similar audio interface device have .teen devet- 
oped, and are described in. for example. D.L, et 
ah. -Integrated Web and Telephone >*£***~!™ 
Bell Labs Technical Journal, pp. 
and J.C. Ramming. "PML: A Language Interface Net- 
worked Voice Response Units." Workshop on internet 
Programming Languages, ICCL '98. Loyola Univers-ty. 
ChSgo. Illinois. May 1998. both of which are mcorpo- 
rated by reference herein. 

[00031 Users developing Interactive Voice Response 
IVR) applications to make use of the aud.o mterface 
edLues described in the above references generally 
SXecostiy special-purpose IVR hardwa^en 
costing more than $50,000. The expense ^.ated 
with this special-purpose IVR hardware Inverts many 
users, such as small business owners and ^duals. 

from building IVR ^^ onstoT ^tX'Zt 
users are therefore unable to configure «*™£P«£ 
so as to facilitate access by telephone or other aud.o 
interface device. 

Summar y "f the Invention 



[00041 The present invention provides apparatus and 
Eds for implementing Interactive Voice Response 
(IVR) applications over the Internet or other computer 
nTork An illustrative embodiment of me ■rwenton . 
an IVR platform which includes a speech synthesizer a 
orammlr generator and a speech recover. The 
fpeTcn synthesizergenerates speech which character^ 
izes the structure and content of a web page removed 
oZ the network. The speech is delivered to a user via 
a telephone or other type of audio interface dev^ The 
grammar generator utilizes textual ^ rmat,on P^^ 
from the retrieved web page to produce a grammar .The 
grammar is then supplied to the speech recogmzer and 



used to interpret voice commands generated by the us- 
er The grammar may also be utilized by the speech syn- 
thesizer to create phonetic information, such that simHar 
phonemes are used in both the speech recognizer and 
s the speech synthesizer. In appropriate applications _ 
such as name dialing directories and other appl.cat.ons 
having grammars with long compilation times, the gram- 
mar produced by the grammar generator may be par- 
tially or completely precompiled. _ 
10 [00051 An IVR platform in accordance wrth the nven- 
tion may also include other elements, such as. for ex- 
ample a parser which identifies textual information in 
the retrieved web page and delivers the textual mforma- 
tion to the grammar generator, and a vp.ce processor 
is which also receives web page information from the pars- 
er The voice processor uses this information .to deter- 
mine which of a number of predefined ^els best cha^ 
acterized a given retrieved web page. The models are 
seized to characterize various types and arrange- 
20 ments of structure in the web page such as section 
headings, tables, frames, forms and the take, so as to 
simplify the generation of a correspondmg verbal de- 

JSoeTln accordance with another aspect of the to- 
ss vention. the speech synthesizer, grammar generator 
and speech recognizer, as well m other elements of the 
IVR platform, may be used to implement a d,alog system 
In which a dialog is conducted With the user r , onde o 
control the output of the web page information to theus- 
30 er. A given retrieved web page may include, for exam- 
ple . text to be readto the u SCT by the speech synmesrzer. 

aprogramscrlptforexe^^ 
esse? and a hyperlink for each of a set o debated 
spoken msponsesvAichmaybereceWedfrom^^ 

35 The web page may also include one or 

that are to be utilized when the speech recognrzer re- 
jects a given spoken user input as unrecognizable. 
[00071 An IVR platform in accordance with the .nven- 
ion may be operated by an Internet Serv.ce Provider 
4 o rsP)7othertype of service provider. By pemttttlngd.- 
aloglbased IVR applications to be built by ("V™"-* 
web pages, the invention opens up a new class of Inter- 
net applications to the general Internet population. For 
example. Internet content developers are not required 
45 to own or directly operate an IVR platform if they have 
access to an IVR platform from an ISP. Th,s ,s a drastic 
departure from conventional approaches to providing 
IVR service, which typically require the ownersh.p ot ^ex- 
pensive IVR equipment. An ISP with an IVR platform 
50 system will be able to sell IVR support semces to the 
general public at relatively low cost. 
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Brief Descript '"" gj tha Drawings 



55 [00081 FIG.1isablockdiagramofasystem.nclud,ng 
a web-based interactive voice response (IVR) platform 
in accordance with the invention. „ hau _ h 
. [0009] FIG.2showsamoredetailedv.ewoftheweb- 
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based IVR platform of FIG. 1. 

Detailed Description of the Invention 

[0010] The present invention will be illustrated below 5 
in conjunction with an exemplary system. It should be 
understood, however, that the invention is not limited to 
use with any particular type of system, network, network 
communication protocol or configuration. The term "web 
page" as used herein is intended to include a single web 10 
page, a set of web pages, a web site, and any other type 
or arrangement of information accessible over the World 
Wide Web, over other portions of the Internet, or over 
other types of communication networks. The term "plat- 
form" as used herein is intended to include any type of 15 
computer-based system or other type of system which 
includes hardware and/or software elements configured 
to provide one or more of the interactive voice response 
functions described herein. 

20 

1. SYSTEM DESCRIPTION 

[0011] FIG. 1 shows an exemplary information retriev- 
al system 100 in accordance with an illustrative embod- 
iment of the invention. The system 100 includes a web- 25 
based IVR platform 102, a network 104, a number of 
servers 106-/, / = 1, 2, ... N, and an audio interface de- 
vice 108. The network 104 may represent the Internet, 
an intranet, a local area or wide area network, a cable 
network, satellite network, as well as combinations or 30 
portions of these and other networks. Communications 
between the IVR platform 102 and one or more of the 
servers 106-/ may be via connections established over 
the network 104 in a conventional manner using the 
Transmission Control Protoco/lnternet Protocol (TCP/ 35 
IP) standard or other suitable communication protocol 
(s). The servers 106-/ may each represent a computer 
or group of computers arranged in a conventional man- 
ner to process information requests received over net- 
work 104. The audio interface device 108 may be, for <o 
example, a telephone, a television set-top box, a com- 
puter equipped with telephony features, or any other de- 
vice capable of receiving and/or transmitting audio in- 
formation. The audio interface device 108 communi- 
cates with the IVR platform 1 02 via a network 1 09 which 45 
may be, for example, a public switched telephone net- 
work (PSTN), a cellular telephone network or other type 
of wireless network, a data network such as the Internet, 
orvarious combinations or portions of these orothernet- 
works. Although shown as separate networks in the il- so 
lustrative.embodiment of FIG. 1, the networks 104 and 
109 may be the same network, or different portions of 
the same network, in alternative embodiments. 
[0012] FIG. 2 shows the IVR platform 102 in greater 
detail. The IVR platform 102 includes a web browser 110 55 
which is operative to retrieve web pages or other infor- 
mation from one or more of the servers 106-/ via network 
104. The web browser 110 may be a conventional com- 



mercially-available web browser, or a special-purpose 
browser designed for use with audio interface device 
108. For example, the web browser 108 may support 
only a subset of the typical web browser functions since 
in the illustrative embodiment it does not need to display 
any visual information, i.e., it does not need to process 
any image or video data. The browser 110 retrieves text, 
audio and other information from the server 106 via the 
network 104. The browser 110 may be configured to 
play back the retrieved audio in a conventional manner, 
such that the playback audio is supplied to the audio 
interface device 108 via the network 109. The browser 
110 delivers the retrieved text and other information to 
an HTML parser 112. The parser 112 performs preproc- 
essing operations which configure the retrieved text so 
as to facilitate subsequent interpretation by a voice proc- 
essor 114 and a grammar generator 120. The retrieved 
text is assumed in the illustrative embodiment to be in 
an Hyper Text Markup Language (HTML) format, but 
could be in other suitable format(s) in other embodi- 
ments. For example, the IVR platform 102 may also be 
configured to process web page information in a Phone 
Markup Language (PML) format. PML is a language 
specifically designed to build telephone-based control 
into HTML pages, and including PML capability in the 
IVR platform allows it to better support a wide variety of 
web-based IVR applications. 

[0013] The voice processor 114 performs analysis of 
the text and other web page information supplied by the 
HTML parser 112, and generates corresponding verbal 
descriptions which are supplied to a text-to-speech 
(TTS) synthesizer. 11 6. The HTML parser 112, voice 
processor 114 and TTS synthesizer 116 transform the 
text and other web page information into speech which 
is delivered to the audio interface device 1 08 via the net- 
work 109. The grammar generator 120 utilizes the text 
and other web page information received from the 
HTML parser 112 to produce one or more speech rec- 
ognition grammars which are delivered to a speech rec- 
ognizer 122. The speech recognizer 122 receives 
speech input generated by the audio interface device 
108, and utilizes the grammar produced by grammar 
generator 120 to recognize words in the speech. Appro- 
priate indicators of the recognized words are then sup- 
plied to the spoken command interpreter 124, which in- 
terprets the indicators to generate corresponding com- 
mand signals. The command signals are supplied to a 
processor 130 which controls the operation of at least a 
portion of the IVR platform 102. The IVR platform 102 
further includes a dual-tone multiple frequency (DTMF) 
decoder 126 which decodes DTMF signals received in 
platform 1 02 from the audio interface device 1 08 via the 
network 109. Such signals may be generated, for exam- 
ple, in response to selections offered in the audio play- 
back or speech supplied from IVR platform 102 to the 
audio interface device 108. The decoded DTMF infor- 
mation is supplied from the decoder 126 to the proces- 
sor 130. 
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[0014] The processor 130 interacts win a memory 
32 and with the web browser 110. The processo 130 
may be a microprocessor, central processmg un.t ap- 
pleaJoVspecific integrated circuit (ASiC) or any 
dUal data processor which directs the operation o at 
St a portion of the IVR platform 102. For example^ 
pressor 130 may be a processor in a computer which 
Elements the web browser 110 or one or more of the 
other elements of the IVR platform 102. The ■ ^emory 
132 may represent an electronic memory, a magnebc 
2 an optical memory or any other memory asso- 
rted with the IVR platform 102. as well as porbon, > or 
combinations of these and other memones. For exam- 
pTe memory 132 may be an electron* memory of a 
computerwhich.asnoted above, maya^indudeproc- 
essor 130. In other embodiments, the IVR ^^02 
^ybe implemented using several interconnected conv 
puters as well as other arrangements of surtable 

Prr^ynthesi z er 116. speech recognizer 
22 Jpoken command interpreter 124. DTMF decoder 
26 pracessor 130 and memory 132. as well as other 

elements of PVR platform 1 02. may be elements of omh 

form such as the Intuity/Gonversant system or Lucent 
SpTecn Processing System (LSPS). both from Lucent 
Technologies Inc. of Murray Hill. New Jersey. As prevw 
ous* noted, the IVR platform 102 may be '^nted 
us ng one or more personal computers e, W edj* 
SmLrcially available speech and telephony system 
£aX It should be noted that the dotted .me pac- 
tions between platform 102 and aud,o «*8mrttonda- 
vica 108 in PIG. 2 may represent, e.g.. a s.ngle connec- 
«on established through the network 109. such as a tel- 
ephone line connection established through a PSTN or 
a cellular or other type of wireless network. 
footS The IVR platform 102 in an ai- 
lment tnay be configured to respond to either voice corn- 
mandsTor DTMF signals in one of the following three 
ZZ: d I DTMF only, in which descriptions rnck.de 
Treses to associate, e.g., button numbers on audio ^ 
terface 108 with information available v.a a retneyed 
weTpage; (2) voice on,, where a 
of a retrieved web page is gwen .n the form o speech 
Generated by TTS synthesizer 116; and (3) both DTMF 
and vie. where both speech description and phrases 
fdentiWing button numbers and the like may be given. 
The DTMF only mode may be desirable when operating 
lu interface" 108 in a noisy environment, such as e 
busy city street or in a crowd of people, because back- 
ground hoise might be interpreted as voice ^co^nds 
? y ,VR platform 1 02. The voice only mode is often mo t 
desirable, because it tends to produce the most rapid 

K e 71 de tf voice processor 114 in IVR platform 102 
Ses the output from the HTML parser 112 and fcrther 
aXes the corresponding retrieved HTML web page 
to Sy structure such as. for example, sect™ head- 



ings, tables, frames, and forms. The voice .processor 
114 in conjunction with TTS synthes.zer 116 hen gen- 
erates a corresponding verbal description of the pjge. 
In aeneral such a verbal description may include 
. speeTotiput corresponding to the page text along 

with descriptions of sizes, locations and poss bly other f 
Sanation about images and other items on the pag. 
[00181 Depending on the preference of the user, the 
page can be described by content or by ■structure For „ 
10 example, a user may be permitted to choose either a 
description mode or an inspection mode. In an example 
3 f toe description mode, the IVR platform 102 will im- 
mediately start to describe a new web page upon re- 
Ev .ting the various TTS voices to indicate vanous 
15 special elements of the page. The user can command 
IVR platform 102 to pause, backup, skip ahead, etc.. in 
banner similar to controlling an 
cept that content elements such as sentences and par- 
agraphs can be skipped. 
20 [00191 In an example of the inspection mode IVR 
Ko m 102 will briefly describe the structure o the 
page and wait for spoken inspert t on commands^ln- 
spection commands aflow the user to -de scend NntoJ 
ements of the page to obtain greater data,. tha m.gh 
25 normally be obtained in the descr.pt.on ™de For ex 
amde each element of a table can be inspected .* 
vial , If a given table element also has structure the 
user can descend into this structure recurswety. The n- 
soectlon mode uses appropriate dialog to provide the 
ao Te S flexibility in controlling the way informabon^ . 
delivered. The user may be given ^control < >ver the TTS 
speakingrate.andtheability^^ 
es to certain HTML element types such as secbon ^head 
Z hyperlink titles, etc. In addrtion. section head ngs 
35 may be'rendered in a different voice from ord.nan i text. 
Jsectton headings are detected, initially only the head 
inqs will be described to the user. Voice commands can 
Z be used to instruct WR platform ^2 to move to a 
particular section, i.e.. the user can speak 
40 Stle to instruct IVR platform 102 to move to that sect™ 
[00201 The above-noted tables may be used for page 
ayou only or may be true tabulations. The page ana^ 
S process implemented in HTML parser 112 ^and 
vofcepmcessor 114 determines which is most Ifceyand 
45 Generates descriptions accordingly. True tabulates 
^described as tables. Tables used for page layout 

eleTnt locals may be described if deemed .mpo - 
tantAn inspection mode can be used to overr.de th s 
50 table^lentwhen.^ 

descriptions. Frames can also be handled m a number 
of ways, including a full page description method and a 
frame focus method. The full page descr.pt.on method 
mLraes the information from allframes into a s.ngle con- 

independently of the frames. The frame focus memoo 
"Ss the user to spectfy a frame to be described o 
inspected, such that voice commands are focused 
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that frame. Forms may be described, for example, in 
terms of field title labels, with the fields addressable by 
speaking field titles. In addition, general items can be 
entered into form fields by spelling, and the above-de- 
scribed inspection mode can be used to obtain menu 
choices. 

[0021] The grammar generator 120 in IVR platform 
102 automatically generates speech recognition gram- 
mar and vocabulary from the HTML of a retrieved web 
page. This is an important feature of IVR platform 102 
that makes it useful for building IVR applications. The 
parsed HTML is analyzed in grammar generator 120 for 
section titles, hyperlinks and other indicators that are to 
converted to speech. Grammar generator 120 then con- 
structs a subgrammar for each indicator by generating 
all possibly ways of speaking subsets of the indicator 
All other voice commands are then combined with the 
subgrammar and a complete grammar is compiled into 
an optimized finite-state network. This network is loaded 
into the speech recognizer 122 to constrain the possible 
sequences of words that can be recognized. Other types 
of grammar generation could also be used in conjunc- 
tion with the invention. 

[0022] A byproduct of the illustrative grammar gener- 
ation process implemented in grammar generator 120 
is the creation of a list of vocabulary words. This list may 
be partially processed by the TTS synthesizer 116 to 
create a list of phonetic transcriptions in symbolic form. 
The same phonemes may be used in both the speech 
recognizer 122 and the TTS synthesizer 116. The sym- 
bolic phonetic descriptions, once loaded into the recog- 
nizer 122, tell the recognizer how the vocabulary words 
are pronounced, thus making it possible for the IVR plat- 
form 102 to recognize virtually any spoken word. 
[0023] In normal operation, the IVR platform 102 de- 
scribes retrieved web pages to the user via the speech 
output of the TTS synthesizer 116. The user controls the 
IVR platform 102 by speaking over the TTS synthesizer 
output, thus "barging in." Echo cancellation may be used 
to remove TTS synthesizer output from the speech rec- 
ognition input so that speech recognition will be unaf- 
fected by the TTS output. When the user speaks for a 
sufficiently long period, the TTS output may be interrupt- 
ed, such that speech recognition can be more effectively 
performed, and the speech recognizer output is inter- 
preted into an IVR platform command. 
[0024] As part of the grammar generation process, 
voice command interpretation tables may be estab- 
lished for use later in the interpretation phase. For ex- 
ample, a stored table of possible command phrases 
may be used to associate computer instructions with 
each phrase. Typically, no ambiguous browser com- 
mand phrases are defined. In the case of processing a 
hyperlink, the Universal Resource Locator (URL) of the 
hyperlink is associated with all possible subsets of the 
hyperlink title. Section titles can be handled in a similar 
manner similar. Subsequently, when a title word is spo- 
ken, the associated URL(s) can be retrieved. 



[0025] It is possible that more than one URL and/or 
browser command will be retrieved when the spoken th 
tle words are not unique. In such a case, a simple dialog 
may be initiated such that the user is given a choice of 
5 full title descriptions that can be selected either by spo- 
ken number or by speaking an unambiguous title 
phrase. If the phrase is still ambiguous, a new and pos- 
sibly smaller list of choices may be given. The user can 
backup at any time if the selection process has not yield- 
to ed the desired choices. This allows the user to refine the 
list and converge on one choice. 

2. PROCESSING DETAILS 

15 [0026] Various aspects of the voice processing and 
other operations performed in the IVR platform 102 of 
FIG. 2 will be described in greater detail below. 

2.1 HTML Parsing 

20 

[0027] As noted above, the HTML parser 112 parses 
HTML in retrieved web pages for the purposes of facil- 
itating production of speech output and generation of 
grammar. The HTML parsing process is purposely kept 

25 relatively simple. Full context-free parsing is not re- 
quired and may even be undesirable because, while 
HTML is typically well structured, many real-world 
HTML pages include software bugs and other errors. 
Therefore, relying on the HTML standard and enforcing 

30 a strict context-free parsing will often be counterproduc- 
tive. 

[0028] Proper generation of speech output requires 
an explicit representation of the structure of a given web 
page. The HTML parsing process is used to obtain a 

35 representation of this structure. Important elements 
such as frames, tables and forms are identified and their 
scope within their containing elements is analyzed. For 
example, a form can be contained in a table, which it 
turn can be contained in a frame, etc. A critical part of 

40 this analysis is to determine the structural significance 
of these elements as opposed to their graphical signifi- 
cance. For example, several levels of tables may be 
used in a web page for the sole purpose of alignment 
and/or generating attractive graphics around various el- 

^5 ements. In such a case, the entire set of tables may be 
structurally equivalent to a simple list. Proper voice ren- 
dering in this case requires that the tables be ignored 
and only the bottom-level elements be spoken, i.e., de- 
scribed to the user as a list. In the case of a "real" data 

50 table, the table would instead be described as such. 
[0029] The parsing process itself presents two signif- 
icant problems which are addressed below. The first is 
that various relationships must be derived from the 
HTML and explicitly represented, whereas a normal 

55 browser replaces the explicit representation with a ren- 
dered page image. Thus, the representation must ex- 
plicitly know, e.g., which words are bold, italic and part 
of a link title, as opposed, e.g., to those that are italic 
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and part of an H3 title. Any particular combination could 
have significance in showing relevant structure. This 
problem is addressed in the HTML parser 112 by ren- 
dering" the page into data structures. Each string of tex 
with uniform attributes has an attribute descriptor that 
specifies all the features, e.g.. such as bold, link text, 
heading level, etc.. currently active in that string. This 
does not itselfprovideahlerarchicalstructure. However 
such a structure, although generally not necessary at 
the HTML source level, can be generated by examining 
the tag organization. 

[0030] The second parsing problem is that HTML pag- 
es often include errors. This means that a document that 
appears well-structured on the screen may be poorly 
structured at the source level. The HTML parser 112 
must analyze the improperly structured source and de- 
termine a well-formed structure that is equivalent to 
what the user would see on the screen. This can be 
tricky in some common cases, such as a missing <TD> 
within a table, which can cause a conventional browser 
to discard the element. This is particularly troublesome 
for cases involvingform elements. This problem should 
become less significant as automated tools become 
more widely used. However, such tools are also likely 
to lead to a proliferation of excess HTML, e.g., multt- 
level tables used for layout. 
[0031] As previously noted, the grammar generation 
process requires extractingthe hyperlink titles, and sav- 
ing the URLs from the page. Any so-called alternative 
or "ALT" fields, intended for use with browsers which 
have no image capabilities, may also be extracted as 
part of this process. In addition, certain other text such 
as section headings can be included in the speech 
grammars. The parsing operations required to do this 
extraction can be implemented using conventual reg- 
ular expression parsing. 



Utilizing the inspection mode allows the user to descend 
recursively into structural elements of the page. The us- 
er can switch between the description and inspection 
modes under voice control. 

2.2.1 Structure Description 



2.2 Verbal Rendering 



[0032] The web page description generated in the IVR 
platform 102 is referred to herein as a verbal rendering 
of the web page. In an illustrative embodiment, the user 
can be permitted to decide whether to have automatic 
presentation of the title of the page. If the user has se- 
lected automatic presentation of the page title, the title 
will be stated to the user. The verbal rendenng will then 
continue with either a description of the page content or 
a description of the page structure, again depending, e. 
g on previously-established user preferences. Typical- 
fy.'the simpler of these two approaches Is the structural 
page description. 

100331 As previously noted, two modes of page de- 
scription operation may be provided: a description mode 

and an inspection mode. In the description mode the 
IVR platform will continue to render the page until in- 
structed otherwise or the description is complete^ The 
inspection mode gives the user the initiative such that 
the user can ask questions and get specfic answers. 



[0034] The page structure is generally described in 
terms of the placement of elements like images, tables 
10 and forms. In the inspection mode, the user will typically 
qet a top-down description with options to open vanous 
elements. Consider as an example a simple web page 
made of three forms: a title/information frame across the 
top an index bar down the side, and a main page. A top- 
15 level description of this page might be "a title frame, in- 
dex frame and a page." In this case, the user Wd 
specify a focus to one of the three areas for further de- 
scription. During navigation, links in the title and/or index 

frames would be available at all times or only on requesL 
20 based on user preference. Certain other common fea- 
tures such as a single-entry search form, may also be 
described as a top-level layout Item even if not in a sep- 
arate frame. If the page contains a search form, the page 

could be described as "a title frame, index frame and a 
25 page with a search form." ., . n 

10035] Description of the main page may be based on 
apparentstructure.Forexample.rfmerearefoursection 

entries, i.e.. <H1> entries, on the page, then the descnp- 
tion would be "a page with five sections." The section 
30 headers, i.e.. <H1> contents, plus "top of page would 
be available for speaking to jump to that section. If the 
usersaysnothing.thenthesystemeitherwa.tsorstarts 

with the first section, based on user preference Note 
that other entities can be the basis for section break- 
35 down. For example, a page with several lists, each pro- 
ceeded by a short paragraph of plain text, could be bro- 
ken down into one section per list, with the apparent 
heading paragraph being spoken to the user. 
10036] Description of a section may also be done 
40 based on apparent structure. If the section is plain text 
then the number of paragraphs is announced and 
speaking begins, with navigation between paragraphs 
supported. Subsection breakdown can be performed in 
a similar manner, based on the presence of lower-level . 
45 headers or bold lines that appear to be used as sec ion 
headers. This subsection analysis will probably not go 
past this second level as the user may be unable to keep 
track of position with more levels. All other information 
could be read sequentially, 
so 10037] If the page includes a table, a determination is 
made as to its purpose. Examples of different pumoses 
include graphics, alignment, or data. Graph.csmd.ca es 

or border, and such tables are ignored. ThJMWJwjD. 
55 betweenar^nmentanddataisthatmanalignmenUabe 

the contents are inherently one-dimensional whereas ,n 
a data table the contents are arranged in a two-d.men- 
sional'array. The contents of an alignment table are eh 
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ther treated as a list or ignored, based on whether sig- 
nificant alignment would be apparent to a viewer. A data 
table is described as such, with the number of rows and 
columns announced, and an attempt made to locate the 
row and column headers. Navigation based on the two- 
dimensional structure is available. 
[0038] Form description depends on the relative size 
of the form within a page. One single-entry form may be 
handled in the manner described above. A larger form 
that appears to be only part of a page may be an- 
nounced as such but is generally accessed as its ele- 
ments appear in the reading. Direct navigation is possi- 
ble based on form number and element number. Finally, 
a page that is mostly a form is treated as a form rather 
than a page. An attempt is made to locate the name of 
each entry to aid in description and direct navigation. 
Also note that a section, subsection or other localized 
form within a page may be treated in a similar manner. 
This introduces modal processing where once the form 
is "entered," then navigation is form-based rather than 
paragraph or section based, until the form is "exited," L 
e., submitted or skipped. 

2.2.2 Content Description 

[0039] Page content is described to the extent possi- 
ble by using IVR platform 102 to synthesize text on the 
page and describe the known content of images, tables, 
forms and other structure. More specifically, designated 
types of speech may be generated for each of the vari- 
ous HTML elements, e.g., for hyperlink titles, bold text, 
form field labels, and other elements useful in naviga- 
. tion. The designated types of speech can be user-de- 
fined. 

2.3 Web Page Analysis 

[0040] In accordance with the invention, web page 
analysis carried out in the IVR platform 102 attempts to 
fit a given web page to one of several predefined page 
models, with a default top-down strategy used for pages 
that do not fit. The objective is to maximize user com- 
prehension of pages by designing models that have an 
easy-to-remember structure, i.e., we want to prevent a 
user from getting lost and make it easy to locate relevant 
parts of a page. For this reason the models may be 
made inherently simple and mostly sequential with min- 
imum hierarchy. Analysis consists of the two steps of 
identifying the best model and then fitting the page con- 
tent to the model parts. Navigation options may then be 
partly controlled by the model. This should simplify use 
for experienced users because the model can be an- 
nounced, thereby signaling the optimum navigation 
strategy. 

[0041] In the illustrative embodiment, three levels of 
models are used: frame, page and section. This is be- 
cause pages can change within otherwise constant 
frames. We want to model the frame layout separately 



because it can remain constant, so use of the frame 
model can simplify navigation. In general, most section 
models may be implemented as page models applied 
to a single section. The following is an exemplary set of 
5 frame models: 

1. Single frame or no frames. In this case, no men- 
tion of frames is made, simply that there is "a page." 

2. Main page and auxiliary. There is a single main 
frame for the page and surrounding frames for con- 
stant material, such as a header, index bar or 
search form. The example given above fits this 
model. 

3. Split-screen. This means that the multiple frames 
are all logically part of the same page, which simply 
permits different areas to be visible at the same time 
while others are scrolled. The difference is that 
some of the frames are intended to remain constant 
while others switch page contents. Note that iden- 
tifying this model can be difficult without an embed- 
ded hint. 

4. Multi-page. This is a catch-all model for all multi- 
frame layouts that do not fit any other model. In this 
case, it is not clear whether the frames remain re- 
lated or which are more constant than others. An 
example would be two frames that each take half of 
the total screen, without any embedded hint that 
one of the other models fits. 

[0042] Each page within a frame set is then matched 
against a set of page models, although the specified 
frame model can imply that certain frames contain cer- 
tain types of pages. The following is an exemplary set 
of page models: 

1 . Title area. This model applies only to a page in a 
title-area frame. No navigation except top to bottom 
reading applies. Links and limited forms are permis- 
sible. 

2. Index area. This model applies to a frame of index 
links. It is treated as a list, or a set of lists if headers 
are apparent. Navigation is top to bottom or to a 
header. A simple form is permissible, which can be 
directly navigated to. 

3. Form. This model indicates that the entire page 
consists mostly of a form. All navigation is custom- 
ized for forms. This can be a main or auxiliary page, 
and also applies to sections. 

4. Plain page. The page has no detectable structure 
beyond paragraphs, if even that. Reading is top to 
bottom with paragraph navigation. This also applies 
to sections. 

5. List. The page consists mostly of a list. Also per- 
missible is header and trailer material. Note that the 
list can be made of structures besides an <OL> or 
<UL>, such as tables. This also applies to sections 
or isolated lists. 

6. Table. The page consists mostly of a true table, 
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plus optional header and trailer matenal The table 
structure is described in terms of rows, columns and 
headers, and navigation based on this structure s 
avai.able.e.g., "read row two." This also appl.es to 

sections or isolated tables. 
7 Image. This means that the page .s mostly an 
mage possibly with a caption or title Th,s .mpl.es 
L it is apparently not really just a l.s «r , b map 
form. This also applies to sections or .solated .mag- 

Tsiide table. This is a list of images, possibly two- 
dimensional, optionally with captions. A .two ^.rnen- 
sional list with apparent row and column headers is 
a table whose contents are images, whereas w,th- 
out these headers it is a slide table. Note that an 
apparent slide table may really be a command list 
where bitmaps are used instead of text, although 
this is a difficult distinction to make 
9. Sectioned page. This model Scales hat the 
page is broken into a number of top-level sections 
by a set of <H1> or other entries. Navigation to in- 
dividual sections is supported, and the section 
header list can be requested. This is also earned 
ouUo one additional subsection level. Subsections 
are only available within the current section 
10. Mulfrsectioned page. This is a special cas .of 
the sectioned page where there are more thar .too 
levels but there is a strict hierarchy! numbering 
scheme, such as "Section 1.A.4." These section 
numbers are used for navigation and are globally 
available. The headers are also available w,th,n the 
active section tree. The difference with the sec- 
tioned page is thatwithoutthestrict numbering. sec- 
Ling is not done past two levels due to the prob- 
ability of confusion. 

[0043] It should be emphasized that the frame, page 
and sectionmodels described above areexamplesonly 

and a subset of these models, as well as combinations 
orVhese and other models, may be used ,n a given em- 
bodiment of the invention. 



e q by caption, and the request of a particular .mage 
may be done by number, with numbering done .n row- 
maj>r order. Typically, no mention of these .mages js 
made while reading the text. Isolated .mages, e.g im- 
5 age-only paragraphs or table elements may be de- 
scribed, e*. as "an image captioned ..." and poss.bly 
with the size announced. 



2.3.2 Tables 



2.3.1 Images and Text 

F00441 In the illustrative embodiment, paragraphs are 
generally read top to bottom, with repeat and skip com- 
S being available for navigation. Paragraphs «n a 
sTcLcanbeoptionalfynumberedforqu.cknav.gat.on 

A!mostanynon-Ltitemwi.lsta rt anewparagraP d The 
main embedded items are links, font changes and im- 

Sound them, but are considered separate paragraphs 
/they stand alone on a given "line" of the page. Embed 
ded links maybe read in adifferentvo.ee. Font changes 
are ormal* ignored, but user preferences can. * set 
to assign different voices to them. A paragraph wrth em- 
bedded images may be announced as such before any 
of its textual content is read. Images can be desenbed. 



100451 In accordance with the invention, tables are 
inalyLi o classify their purpose. Tables with a angle 
eteS are generally ignored and their element used 
Sut regard to the table. Tables with row and or col- 

described and navigated as such. AH other tab es are 
examined for a fit to various models. An exemplary set 
o? ah* models maybe as follows: A table w.th too e^ 
ements. one of which is an image . » taken obe an n> 
20 aqeandtitlecombination.Thisbecomesan image and 
h" table Itself is ignored. A table whose e^ments are 
mostly form elements is taken to be a form. The table 
Snjcture is used to associate titles w.th elements and 
o^ablish next/previous relationships but .s othe^e 
25 not mentioned to the user. A table whose elements are 
plain text or links is taken as a list. 



2.3.3 Forms 

30 [00461 In the illustrative embodiment forms may be 
classified as either "embedded" or "p*m " An embed- 
ded form with a single element or other y ofsmaH 
form may be viewed as an entry area. e^.. a searcn 
S Z These types of forms can be treated as top-level 
35 items e.g.. search, or as plain paragraphs e.g.. a give 
s your comments" element at the end of a page. Al 
o her forms are treated as plain forms The main point 
o the form analysis is to enable descnption and form- 
splc^vkjatiTn. We generally want to class^ 
,o elms in aform as to whether they 

live" or are a title, instructions, etc. associated won a 
TaUlar eiement. We also want to .^ Previous/ 
next relationships. Note that ^^J* £ 
fore or after a form can be «*^^J^21 ' 
45 * a as a title or notes. The analysis in the illustrative 

tlcallv inside or close to the <FORM> and </F ORM> pair 
e^n *ough form elements can be located throughout 
72 page The analysis attempts to make use o adp- 
so cency in the HTML source, or in corresponding tables 
N Z at a table with headers that contair, 
and regutar-formentries maybe considered a form wrth 
table navigation added, whereas a table w.th only a few 
en^s might instead be described as a table with .no 
55 dental form elements. 
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2.4 Automatic Grammar Creation 

[0047] As noted above, the grammar generator 120 
in IVR platform 102 generates speech grammars from 
hyperlink titles and other web page information. This 5 
grammar generation may involve, for example, creating 
a Grammar Specification Language (GSL) description 
of each possible subset of the title words. The resulting 
GSL is compiled and optimized for the speech recog- 
nizer 122. In addition, the vocabulary words used in the 10 
grammar are phonetically transcribed using the TTS 
synthesizer 116. Additional details regarding GSL can 
be found in, for example, M.K. Brown and J.G. Wilpon, 
"A Grammar Compiler for Connected Speech Recogni- 
tion" IEEE Transactions on Signal Processing, Vol. 39, is 
No. 1, pp. 17-28, January 1991, which is incorporated 
by reference herein. 

2.4.1 Combinatorics 

20 

[0048] Flexibility may be added to the voice naviga- 
tion commands through the use of combinatoric 
processing, e.g., computing all 2 rv1 possible combina- 
tions of the title words, while keeping the words in order. 
This process provides a tightly constrained grammar 25 
with low perplexity that allows ail possible word dele- 
tions to be spoken, thereby giving the user freedom to 
speak only the smallest set of words necessary, e.g., to 
address a given hyperlink. The process can also create 
many redundancies in the resulting GSL description, be- 30 
cause leading and trailing words are reused in many 
subsets. The redundancy may be removed when the 
grammars are determinized, as will be described below. 
Small word insertions may be allowed by inserting so- 
called acoustic "garbage" models between words in the 35 
hyperlink title subsets. This can be done automatically 
by the grammar generator 120. The combinatoric 
processing may be inhibited when <GRAMMAR> defi- 
nitions are encountered . A mixture of hyperlink titles and 
<GRAMMAR> definitions can be used on a single page *o 
to take advantage of the features of each method. 

2.4.2 Grammar Compilation 

[0049] In the illustrative embodiment, grammar com- *5 
pilation generally involves the steps of preprocessing 
the created GSL to include external files, expanding 
macros, parsing the expanded GSL and generating 
grammar network code. The grammar code describes 
grammar rules that define how states of a finite-state 50 
network are connected and what labels are attached to 
the state transitions. For additional details, see M.K. 
Brown and B.M. Buntschuh, "A Context-Free Grammar 
Compiler for Speech Understanding Systems," ICSLP 
'94, Vol. 1, pp. 21-24, Yokohama, Japan, Sept. 1994, 55 
which is incorporated by reference herein. The resulting 
finite-state network is typically large and redundant, es- 
pecially if most of the GSL is created from hyperlink ti- 



tles, making the grammar inefficient for speech recog- 
nition. In accordance with the invention, this inefficiency 
may be reduced in four stages of code optimization. 
[0050] The first stage involves determinizing the 
grammar using the well-known finite-state network de- 
terminization algorithm. This eliminates all LHS redun- 
dancy in the grammar rules making the resulting net- 
work deterministic in the sense that, given an input sym- 
bol, the next state is uniquely defined. All grammar am- 
biguity is removed in this stage. The second stage of 
optimization minimizes the number of states in the net- 
work using the 0(n log (n)) group partitioning algorithm. 
This eliminates all homomorphic redundancy while pre- 
serving determinism. This is the state-minimal descrip- 
tion of the grammar, but is not necessarily the most ef- 
ficient representation for speech recognition. The third 
stage of optimization removes all RHS grammar rule re- 
dundancy. This operation does not preserve determin- 
ism, but does eliminate redundant state transitions. 
Since state transitions carry the word labels that repre- 
sent word models and therefore cause computation, re- 
ducing redundancy in these transitions is beneficial 
even though the number of states is usually increased 
in the process. The last stage of optimization is the re- 
moval of most null, i.e., "epsilon," state transitions. 
Some of these null transitions are created in the third 
stage of optimization. Others can be explicitly created 
by a <GRAMMAR> definition. While null transitions do 
not cost computation, they waste storage and therefore 
should be eliminated. 

[0051] It should be noted that in alternative embodi- 
ments of the invention, grammars may be partially or 
completely precompiled rather than compiled as the 
grammars are used. Such an arrangement may be ben- 
eficial for applications in which, for example, the gram- 
mars are very large, such as name dialing directories, 
or would otherwise require a long time for compilation. 

2.4.3 Phonetic Transcription 

[0052] The above-noted vocabulary words are ex- 
tracted from the grammar definitions during the compi- 
lation process. For example, each word may be proc- 
essed in isolation by a pronunciation module in the TTS 
synthesizer 116 to create phonetic transcriptions that 
describe how each word is pronounced. This method 
has the disadvantage of ignoring context and possibly 
mispronouncing a word as a noun instead of a verb or 
vice versa, e.g., object, subject, etc. Context information 
may be included in order to provide more accurate pro- 
nunciation. 

2.5 Voice interpretation 

[0053] In the illustrative embodiment, voice com- 
mands may be interpreted rapidly by using hash tables 
keyed on the spoken phrases. This is typically a "many- 
to-many" mapping from speech recognizer output text 
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to computer commands or URLs. If more than one URL 
anZ command are retrieved from the table a d sam- 
b guation dialog manager may be utilized to d.rect ft. 
Sir to make a unique selection ^Separate ^hash tables 
can be maintained for each web page v.s*e I so jhrt 

I oaae This can lead to the creation of many hash ta 
b, s'buU e table size is typically sma.1 thus makUjg 
ml an effective method for web page browsmg. For 

terpret the phrase. 

3. GENERAL WEB-BASED IVR APPLICATIONS 



rimsdl The IVR platform 102 in accordance with the 
Son not only'provides a speech-controlled web 
b r er b^can also be used to allow the general ln- 
taTet population to build IVR applications The , advan 
tai of this approach is the elimination of the need fo 

VR eoulpment. As previously noted, typical IVR plat 

Zl not need to make any large investment .n equ.p- 

r 0 551 As noted previously, each ordinary hyperlink ti- 
E a giv^n page or set of pages may be processed to 
^iTslbSSnmars that allow a., spoken subse- 
nuences for the words in the title. For general IVR ap 

tions Macros can be defined either w. hm the joca 

Se a server or a client. In a typical IVR platform a P - 
oSon Java code might be used to perform opera- 
ons aUhe server that could, in turn, control remote de- 



vices through the Internet or the PSTN using additional 
hardware at the remote end. Since HTML pages on the 
nternetform an implicit finite-state network l*» 
work can be used to create a dialog system. The result- 

ZZX* to the user. Even without an applet Ian- , 
guage. such a dialog system can be built using the tech- 

niaues of the invention. . , 

mm MorespecificallyanlVRwebpageimplement- , 

,o IdVsuch a dialog system may include, e.g.. possibly 
nul text to be spoken to the user when the page ,s read 
a program script that would execute operations on the 
o'i pmcossonand a possib* silent hyperlmk or ^e ch 

2 o 1 <GRaV.MAR> tag embedded in ahyperhri k ( e.g HREF 
= -http://www.anywhere.netf J^'Jv 
trieve I call for) messages)" TITLE = Get messages 
SSentaflexlb.esetof alternative f™™** 
a user can say to cause an action such as mating ^a 

The ; user can respond with "get messages, 
the user. The user can j - in tnis sim- 

30 "retrieve messages, or call tot mB8 *«a . 
nia example By speaking a command and following tnis 
£ i Z next web page, the user may then be read 
hnk to fte ^ ' want voice or email mes- 

tSJEX^^*- wHh appropriate 

to cause access to voice messages or email. A third ae 
auOnk mTght be taken when the utterance is not un- 

to return a token to indicate non-recogn.t.oa For eacn 
«o of ftelssage choices there may be a further ■« « 
lb oaoes to deal with functions such as reading, sav 

t dieting messages and responding tc > message, 

tag embedded in a hyperlmk .s HREF - ^J^™ - 
45 vwhere net/" GRAMMAR_FILE = <URL>. In this case 
£ spedfied URL indicates where the grammar file can 
S founT Many other types of 
constructed in a similar manner using the techmques 

so ; h 0059] e The ability to build dialog systems in this man. 
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ISP with an IVR platform system will be able to sell IVR 
support services to the general public at relatively low 
cost. Corporations with more demanding response re- 
quirements may ultimately want to operate their own 
platforms for a limited community of employees, but can 
develop and test their IVR web pages before committing 
to purchase costly equipment. 
[0060] The above-described embodiments of the in- 
vention are intended to be illustrative only. Alternative 
embodiments may incorporate additional features such 
as, for example, Optical Character Recognition (OCR) 
for generating audible information from retrieved web 
pages, analysis of images for verbal rendering, e-mail 
to speech conversion, and speaker verification for se- 
cure access. These and numerous other alternative em- 
bodiments within the scope of the following claims will 
be apparent to those skilled in the art. 



Claims 

1. A method for implementing an interactive voice re- 
sponse application over a network, the method 
comprising the steps of: 

generating speech output characterizing at 
least a portion of a web page retrieved over the 
network; 

processing information in the web page to pro- 
duce at least a portion of at least one grammar; 
and 

utilizing the at least one grammar to recognize 
speech input. 

2. The method of claim 1 further including the step of 
determining which of a set of predetermined models 
best characterizes the retrieved web page. 

3. The method of claim 2 further including the step of 
utilizing a default top-down description process if 
the retrieved web page is not adequately character- 
ized by any of the predetermined models. 

4. The method of claim 2 further including the step of 
applying a plurality of different sets of models to the 
retrieved web page, each of the sets including at 
least one model. 

5. The method of claim 2 wherein the models charac- 
terize structure in the web page including at least 
one of a section heading, a table, a frame, and a 
form. 

6. The method of any of the preceding claims further 
including the step of utilizing the grammar to create 
phoneme information, such that similar phonemes 
are used in both the recognizing and generating 
steps. 



7. The method of any of the preceding claims wherein 
the generating, processing and utilizing steps in- 
clude implementing a dialog system in which a dia- 
log is conducted with a user in order to control the 

5 output of the web page information to the user. 

8. The method of claim 7 wherein the web page in- 
cludes at least one of (i) text to be read to the user, 
(ii) a program script for executing operations on a 

10 host processor, and (iii) a hyperlink for each of a set 
of designated spoken responses which may be re- 
ceived from the user. 

9. The method of claim 7 or claim 8 wherein the web 
15 page includes at least one hyperlink that is to be 

utilized when a given spoken user input is rejected 
as unrecognizable. 

1 0. The method of any of the preceding claims wherein 
20 at least a portion of the grammar produced in the 

utilizing step is precompiled. 

* . 

11. A machine-readable medium storing one or more 
programs for implementing an interactive voice re- 

25 sponse application over a network, wherein the one 
or more programs when executed by a machine 
cause the machine to carry out the steps of a meth- 
od as claimed in any of the preceding claims. 

30 12. An apparatus for implementing an interactive voice 
response application over a network, the apparatus 
comprising means for carrying out each step of a 
method as claimed in any of claims 1 to 10. 

35 13. The apparatus of claim 12 wherein the apparatus 
includes a processor operative to implement at 
least one of the speech output generating step, the 
grammar producing step and the speech recognis- 
ing step. 

40 

1 4. The apparatus of claim 1 2 or claim 1 3 further includ- 
ing a parser which identifies textual information in 
the retrieved web page, the grammar producing 
step being responsive to said identified textual in- 

45 formation. 

15. The apparatus of any of claims 12 to 14 wherein the 
means for carrying out the speech output generat- 
ing step, the grammar producing step and the 

50 speech recognising step are elements of an inter- 
active voice response system associated with a 
service provider. 

16. The apparatus of any of claims 12 to 14 wherein the 
55 means for carrying out the speech synthesizing 

step operates in a description mode, in which, un- 
less interrupted by user input, it provides a complete 
description of the retrieved web page to a user via 
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an audio interface device, and an inspect™ mode 
in which it provides an abbreviated description of 
t^tneved web page and then awaits inspection 
command input from the user. 
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(54) Web-based platform for interactive voice response (IVR) 



(57) A platform for implementing interactive voice 
response (IVR) applications over the Internet or other 
type of network includes a speech synthesizer, a gram- 
mar generator and a speech recognizer. The speech 
synthesizer generates speech which characterizes the 
structure and content of a web page retrieved over the 
network. The speech is delivered to a user via a tele- 
phone or other type of audio interface device. The gram- 
mar generator utilizes textual information parsed from 
the retrieved web page to produce a grammar. The 
grammar is supplied to the speech recognizer and used 



to interpret voice commands and other speech input 
generated by the user. The platform may also include a 
voice processor which determines which of a number of 
predefined models best characterized a given retrieved 
page, such that the process of generating an appropri- 
ate verbal description of the page is considerably sim- 
plified. The speech synthesizer, grammar generator, 
speech recognizer and other elements of the IVR plat- 
form may be operated by a Internet Service Provider 
(ISP), thereby allowing the general Internet population 
to create interactive voice response applications without 
acquiring their own IVR equipment. 
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